Skip to content

feat(upload): idempotent retry of the last committed chunk#145

Merged
rubenhensen merged 2 commits intomainfrom
feat/idempotent-chunk-retry
May 7, 2026
Merged

feat(upload): idempotent retry of the last committed chunk#145
rubenhensen merged 2 commits intomainfrom
feat/idempotent-chunk-retry

Conversation

@rubenhensen
Copy link
Copy Markdown
Contributor

Summary

When a chunk PUT response is lost in flight the client retains the previous CryptifyToken (the one it sent), but the server has already advanced the rolling-token chain. Without server-side help, retry is impossible: the naïve resend hits "token mismatch" or "wrong offset" depending on what the client tries.

This PR adds the missing protocol piece — server-side detection of duplicate retries — so client-side retry (coming next in postguard-js) can actually work.

How it works

The server now caches the most recently committed chunk's (prev_token, prev_uploaded, chunk_len, chunk_sha256, response_token) on the FileState (in-memory for now; will migrate to Postgres in Phase 2). The chunk handler is restructured around a classify_chunk_request helper with three outcomes:

  • NormalNextstate.uploaded == start && header_token == current_token. Write, advance, refresh last_chunk.
  • ReplayLastChunk — every cached field matches AND the body's SHA-256 matches the stored hash. Return the cached response_token, no write, no double-count.
  • Reject — regular 400s. Stricter 400s when the request looks like a retry (matching prev_token + offset) but the body or length differ — never accept different bytes for the same offset.

The body is now read once at the top of the handler so the same bytes feed both the replay-detection hash and the normal write path. Per-upload-limit and file-open are still checked only on the normal-write path; on retry the original PUT already passed those.

Wire compatibility

Existing clients that don't retry see no change. The contract only kicks in when a client sends (prev_token, prev_uploaded) matching the cached record — old clients never do. Documented in api-description.yaml so future clients (postguard-js v1.4 in particular) can rely on it.

Test plan

7 new unit tests cover the classifier:

  • classify_normal_next_chunk — happy path
  • classify_replays_last_chunk_on_matching_retry — full match returns cached token
  • classify_rejects_retry_with_different_body — same prev_token + offset, body bytes differ → 400
  • classify_rejects_retry_with_different_length — same prev_token + offset, length differs → 400
  • classify_rejects_offset_mismatch_with_no_replay — no last_chunk cached → regular 400
  • classify_rejects_token_mismatch_at_correct_offset — right offset, wrong token, no replay match → 400 with the existing token-mismatch message
  • classify_does_not_replay_when_prev_token_does_not_matchlast_chunk cached but a different prev_token → falls through to regular Content-Range 400

CI gate:

  • cargo fmt --check passes
  • cargo clippy --all-targets -- -D warnings passes
  • cargo test --all-targets passes (49 tests, none regressed)

Refs

The deliberately-deferred status-endpoint design from #136 (cross-refresh resume) will be filed as a separate issue once Phase 2 (Postgres-backed sessions) is in place to support it.

When a chunk PUT response is lost in flight the client retains the old
CryptifyToken (the one it sent) but the server has already advanced the
rolling-token chain. Without server-side help, retry is impossible: the
naïve resend hits 'token mismatch' or 'wrong offset' depending on what
the client tries.

Server now caches the just-committed chunk's
(prev_token, prev_uploaded, chunk_len, chunk_sha256, response_token) on
the FileState. The chunk handler is restructured around a
`classify_chunk_request` helper with three outcomes:

- NormalNext: state.uploaded == start && header_token == current_token
  → write, advance, refresh last_chunk
- ReplayLastChunk: every cached field matches AND body sha256 matches
  → return cached response_token, no write, no double-count
- Reject: regular 400s. Stricter 400s when the request looks like a
  retry but the body or length differ — never accept different bytes
  for the same offset.

The body is now read once at the top of the handler so the same bytes
feed both the replay-detection hash and the normal write path. Per-
upload limit and file-open are still checked only on the normal-write
path; on retry the original PUT already passed those.

api-description.yaml documents the contract on PUT /fileupload/{uuid}.

Tests: seven new unit tests cover NormalNext, ReplayLastChunk match,
length mismatch, body-hash mismatch, offset mismatch with no replay
record, token mismatch at correct offset, and prev_token mismatch with
a replay record present (falls through to regular Content-Range 400).

Refs #136 (resume protocol design from
@dobby-coder's audit), encryption4all/postguard-website#117.
@rubenhensen
Copy link
Copy Markdown
Contributor Author

/dobby review this

@dobby-coder
Copy link
Copy Markdown
Contributor

dobby-coder Bot commented May 7, 2026

Dobby is here, Master Ruben! Dobby reads the request and Dobby will review the chunk-retry protocol with great care! Dobby will check the classifier logic, the tests, and the wire-compat story — then Dobby will leave a proper review!

Copy link
Copy Markdown
Contributor

@dobby-coder dobby-coder Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review

Solid PR. The classifier abstraction reads well, the test coverage is thorough (especially the negative cases), and the wire-compat story checks out — old clients never present a (prev_token, prev_uploaded) pair that matches a cached last_chunk, so they fall straight through to the existing paths.

I built and ran the suite locally on feat/idempotent-chunk-retry: cargo test --all-targets → 49 passed, 0 failed. CI's Rust-quality job is green; the amd64/arm64 image builds were still pending at review time.

Things I'd consider before merge

  1. Body is now read before the per-upload-limit and token checks. Pre-PR ordering rejected a request with the wrong cryptifytoken or wrong offset before consuming any body. Now the body — up to chunk_size (~1 MiB) — is always read into memory first so the classifier can hash it. UUIDs are unguessable, so this is narrow, but it does slightly widen the work an attacker with a leaked UUID can force per request. Probably acceptable; flagging in case you want to gate the body read behind a cheap pre-check (e.g., only read the body when state.uploaded == start or last_chunk.prev_uploaded == start).

  2. chunk_sha256 is arguably redundant. compute_hash(last.prev_token, body) == last.response_token already proves the body is byte-identical (modulo a SHA-256 collision, which would also break the rolling-token chain itself). You could drop chunk_sha256 (and chunk_len, since body.len() is implicit in the hash input) from LastChunkRecord and verify replay by recomputing the rolling hash. Two upsides: smaller FileState, and one less invariant to keep in sync when Phase 2 moves this to Postgres. Downside: the current explicit fields are more obvious to a reader. Judgment call — not a blocker.

  3. Duplicated literal "Cryptify Token header does not match" in the classifier (src/main.rs:615) vs. check_cryptify_token (still used by finalize). The comment notes the wording is intentional; an const TOKEN_MISMATCH_MSG: &str referenced by both call sites would prevent silent drift if the message ever changes.

  4. Minor: two shasum.clone() calls (src/main.rs:546 and :553) where one would do — the final CryptifyToken(shasum) could take the moved value and state.cryptify_token / response_token could share a single clone.

  5. Phase 2 note (FYI, not for this PR): when last_chunk migrates to Postgres, response_token becomes a row that must be written atomically with the file write + state.uploaded advance, otherwise a server crash between "write file" and "persist replay record" reintroduces exactly the scenario this PR fixes. Worth keeping in mind for the migration design — probably wants the chunk write to be idempotent at the storage layer too, not just the response-replay layer.

Nothing here is blocking. Approve from this agent's side once CI's image builds finish.

Four small follow-ups from @dobby-coder's review:

1. Cheap pre-check before reading the chunk body. Mirrors the structural
   part of `classify_chunk_request` (normal-next OR replay-candidate)
   and rejects mismatches before consuming up to `chunk_size` bytes,
   restoring the prior security profile against UUIDs leaked to an
   attacker.

2. Drop `chunk_sha256` and `chunk_len` from `LastChunkRecord`. Body
   identity falls out of the rolling-token chain itself: recompute
   `compute_hash(prev_token, body)` and compare to `response_token`.
   Length divergence surfaces as a hash mismatch — one less invariant
   to keep in sync, smaller `FileState`, simpler Phase 2 schema. The
   length-specific test folds into the body-mismatch test.

3. Promote the duplicated "Cryptify Token header does not match"
   string to a `TOKEN_MISMATCH_MSG` const referenced by both
   `check_cryptify_token` (finalize path) and the chunk classifier so
   the wording can't drift silently between call sites.

4. One fewer `shasum.clone()` — the final responder takes the moved
   value while `state.cryptify_token` and `last_chunk.response_token`
   share a single clone.

`sha256_of` is removed (sole caller is gone). Tests gain a
`last_chunk_for(prev_token, offset, body)` helper that builds a
`LastChunkRecord` whose `response_token` is the actual rolling hash,
matching the production handler's invariant.

Refs #145 review.
@rubenhensen
Copy link
Copy Markdown
Contributor Author

Thanks @dobby-coder. Pushed 2a6dac1 addressing #1#4:

  1. DoS gate before body read — handler now does the cheap structural check (normal-next OR replay-candidate) before consuming the body, restoring the prior security profile. Reuses the same predicates classify_chunk_request does so the two stay in sync.
  2. Dropped chunk_sha256 and chunk_len — replay verification is now compute_hash(prev_token, body) == response_token, which is the same construction the rolling-token chain itself relies on. Length divergence falls out as a hash mismatch. The dedicated length test folds into the body-mismatch test.
  3. TOKEN_MISMATCH_MSG const — referenced by both check_cryptify_token (finalize) and the classifier.
  4. One fewer shasum.clone() — the final responder takes the moved value.

Skipping #5 here since it's a Phase 2 concern. Filing it as a tracking note on the Postgres-sessions design (the persist-and-advance ordering will need to be transactional with the file write to keep the same guarantees).

49 tests pass, fmt + clippy clean.

@rubenhensen rubenhensen merged commit 13661c7 into main May 7, 2026
5 checks passed
@rubenhensen rubenhensen deleted the feat/idempotent-chunk-retry branch May 7, 2026 11:33
@github-actions github-actions Bot mentioned this pull request May 7, 2026
rubenhensen added a commit to encryption4all/postguard-website that referenced this pull request May 7, 2026
Bumps `@e4a/pg-js` from `^1.3.0` to `^1.4.0` to pick up the new
upload/download retry machinery (chunk PUTs and download GETs retry
with exponential backoff + full jitter on 5xx and network errors).
The bump itself is a clean semver minor — additive config, no API
changes for the website.

UX additions on top of the bump:

- **Retry indicator.** `src/lib/postguard.ts` now wires an `onRetry`
  callback into the SDK that updates a small Svelte store
  (`retryStatus`). Both `SendButton.svelte` (upload) and
  `download/+page.svelte` subscribe to it and render a single line
  ("Connection hiccup, retrying… (attempt N of M)") underneath the
  active spinner during the retry window. Cleared on success or
  terminal error so a stale event can't leak between operations.

- **`UploadSessionExpiredError` distinguished from generic 5xx.** The
  SDK now raises this dedicated error when cryptify reports
  `upload_session_not_found` (idle past the configured TTL or the
  server restarted). On upload it surfaces a regular Error state
  with `serverError = false` so the existing copy doesn't blame the
  server for a state-loss problem retry can't fix; on download it
  gets its own state and copy ("Upload session expired" / "Ask the
  sender to send the files once more"). Localised in en + nl.

Also unblocks the husky pre-commit hook: the repo had
`prettier-plugin-svelte` as a devDependency but never registered it
in the prettier config, so `prettier --write` on staged paths failed
on .svelte files (see #143's bypass note). The plugin is now wired
up via the `plugins` field in the package.json `prettier` block, and
all 41 .svelte files were reformatted to apply the resulting style
(mostly stray semicolons + minor whitespace). Six pre-existing
`eslint-disable-next-line` comments that prettier broke by wrapping
their target tags onto multiple lines were widened to
`eslint-disable` / `eslint-enable` blocks.

Refs: postguard-website#117, encryption4all/cryptify#136,
encryption4all/cryptify#145, encryption4all/postguard-js#47.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant